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Abstract. Sparsity has become a key concept for solving of high-dimensional 
inverse problems using variational regularization techniques. Recently, using 
similar sparsity-constraints in the Bayesian framework for inverse problems by 
encoding them in the prior distribution has attracted attention. Important 
questions about the relation between regularization theory and Bayesian inference 
still need to be addressed when using sparsity promoting inversion. A practical 
obstacle for these examinations is the lack of fast posterior sampling algorithms 
for sparse, high-dimensional Bayesian inversion: Accessing the full range of 
Bayesian inference methods requires being able to draw samples from the posterior 
probability distribution in a fast and efficient way. This is usually done using 
Markov chain Monte Carlo (MCMC) sampling algorithms. In this article, we 
develop and examine a new implementation of a single component Gibbs MCMC 
sampler for sparse priors relying on Ll-norms. We demonstrate that the efficiency 
of our Gibbs sampler increases when the level of sparsity or the dimension of the 
unknowns is increased. This property is contrary to the properties of the most 
commonly applied Metropolis-Hastings (MH) sampling schemes: We demonstrate 
that the efficiency of MH schemes for Ll-type priors dramatically decreases when 
the level of sparsity or the dimension of the unknowns is increased. Practically, 
Bayesian inversion for Ll-type priors using MH samplers is not feasible at all. 
As this is commonly believed to be an intrinsic feature of MCMC sampling, 
the performance of our Gibbs sampler also challenges common beliefs about the 
applicability of sample based Bayesian inference. 
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1. Introduction 

1.1. Sparse Bayesian inversion 

Solving high-dimensional inverse problems using sparsity constraints as a priori 
information has led to enormous advances in various application areas. Total variation 
(TV) deblurring |471 E] uses sparsity constraints on the gradient of the unknown 
quantity and is successfully used in many imaging applications. By the notion of 
compressed sensing [141 115 ] . a number of techniques are summarized, which rely on 
the idea that high quality reconstructions can be obtained from a small amount of data, 
if a sparse basis for the unknowns is a priori known. Traditionally, sparsity constraints 
are formulated in the framework of variational regularization when introduced as a 
priori information to inverse problems. One popular approach is to use regularization 
functionals incorporating LI norms. However, this type of sparsity constraints can also 
be formulated and examined in the framework of Bayesian statistics (33J . Addressing 
high dimensional, ill-posed inverse problems as problems of Bayesian inference has 
gained growing attention over the years |501 I2"U1 13T1 130] . It allows an easy formulation 
of a priori information on the solution via a priori probability distributions (prior). 
Furthermore, a specific inference strategy that is called the maximum a posteriori 
estimate (MAP) corresponds to variational regularization (the prior corresponds to 
the regularization functional) . In general, the Bayesian solution to an inverse problem 
is given by the a posteriori probability distribution (posterior) over the parameter space 
(the MAP estimate is the point maximizing this distribution) . The analysis of sparse 
Bayesian inversion is far less elaborate up to now, and a number of exciting questions 
still remain to be addressed. In particular, sparse inversion is an interesting topic 
to study the relation between regularization theory and Bayesian inference. This 
relation is well understood for regularization using L2 norms, which corresponds 
to Bayesian inference with Gaussian priors, see, e.g., [3T]. While the differences 
in these scenarios are subtle, they become way more pronounced in the context of 
sparse inversion using Ll-type priors, i.e., priors, which rely on LI norms of the 
unknowns |351 [35] . A central tool to study these differences is the examination of the 
posterior by Monte Carlo sampling methods. The standard sampling techniques were 
designed for Bayesian inference in low-dimensional, well-posed problems and often fail 
when used in scenarios arising from typical inverse problems. For these reasons, a 
number of specific sampling techniques for ill-posed, high-dimensional problems have 
already been developed [28]|55J[53J|1T]|31]. However, these techniques mainly address 
Gaussian priors. For the sparsity-promoting Ll-type priors they may fail dramatically. 
This observation is in line with the fact that efficient optimization techniques for the 
corresponding variational regularization schemes are still a vital field of research as 
well [21] [5J. 

1.2. Contributions and Structure 

Our work was motivated by questions that arise when Bayesian inference using 
general Ll-type priors is applied to edge preserving image reconstruction, similar to 
the scenarios discussed in |.36l 135] 133] . A major problem we and others faced was 
that the conventional Markov chain Monte Carlo (MCMC) tools for sampling the 
posterior distribution fail in such situations, rendering many examinations infeasible. 
Therefore, we develop and examine new and more efficient implementations of 
sampling algorithms for these situations first. In this paper, we present a fast 
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implementation of a Gibbs sampling algorithm and study its performance in two 
typical inverse problems scenarios. Thereby we provide a solid basis for addressing 
more sophisticated questions in sparse Bayesian inversion in the future. 
In Section [2] we describe the setting and methods used. Detailed numerical 
examinations of all MCMC algorithms for two test scenarios are presented in Section[3] 
In Section^] the results are discussed and we point to future directions of development. 
Additionally implementation details and code for the new sampling algorithms is 
provided in |Appendix A| 



2. Methods 



In this section, we will first introduce the general setting for our examinations (Section 



2.1 1 and the basics of Bayesian inference. Then we review the basic principles of 



MCMC-based posterior inference and present the most popular MCMC sampling 
schemes, the Metropolis-Hastings algorithm and the Gibbs sampling algorithm 
(Section 2.2 1. The intention of these first two sections is to make the article more 
accessible for readers which manly used variational regularization techniques so far 
and have little experience with Bayesian techniques. The more advanced reader can 
skip these sections. Section [2 . 3| contains the main contributions of this article, i.e., the 
development of a Gibbs sampling scheme for Ll-type priors which relies on a robust 
numerical implementation of an exact, explicit sampling from the conditional single 
component posterior. In the last section (Section 2.4 1, we will explain the methods 
used for the evaluation of the sampling performance in the computational studies. 
The experienced reader may, again, skip this section. 



2.1. General Setting and Bayesian Formulation 

In general, we consider the inverse problem of solving a continuous, linear, ill-posed 
operator equation. Here, we start from the following discrete model chosen for 
obtaining a computational solution (the computational model): 

m = Au + e, (1) 

where m £ M. k represents the given measurement data, u G R" represents the 
unknowns derived from a discretization of the computational domain, A € Mr xn is 
the discretization of the continuous forward operator with respect to the domains of u 
and m and e € K fc is an additive, stochastic noise term. Accounting for the stochastic 
nature of the noise term renders into a relation between the fc-dim random variables 
M and £ (the likelihood model): 

M = Au + £ (2) 

See |2U1 [3] for details on the implications of this step. For simplicity, we assume 
£ ~ A/"(0, cr 2 Ik), o > here, where Ik is the fc-dim identity matrix (the extension to 
general Gaussian noise is straight forward). Now, the conditional probability density 
of M given u is determined by Q and is, thus, called the likelihood density: 

ft 

pu(m\u) = ( ^2 J exp(--^\\m- Au\\j\ (3) 

Due to the ill-posedness of Q, inference about u given M on the basis of ^ is not 
feasible with standard statistical inference strategies. Bayesian inference strategies 
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rely on considering u as a random variable itself (U in our notation) and on encoding 
a priori information about U in its density, p pr (u), which is therefore called the prior. 
Then, the model can be inverted using B ayes' rule: 

p H (m\u)p pr (u) 

Ppost{u\m) = — ^ (4) 

p(m) 

The conditional density of U given M is called the posterior. In Bayesian inference, 
this density is the complete solution to the inverse problem. The term p(m) is called 
the model- evidence and for our aims, it is just a normalizing constant, which is of 
no further importance. There are several ways to exploit the information about U 
contained in the posterior. The most popular one, called the maximum a posteriori 
estimate (MAP), is to infer a point estimate for U by searching for the highest mode 
of the posterior. Another way to obtain a point estimate, called the conditional mean 
estimate (CM), is to compute the mean/expected value of the posterior: 

«map := argmax { p pos t(u\m)} (5) 

ttGR" 

ucm := E [u\m] = J u p post (u\m) du (6) 

Practically, computing the MAP estimate is a high-dimensional optimization problem, 
whereas computing the CM estimate is a high-dimensional integration problem. 
Apart from point estimates, computing confidence intervals, conditional covariance 
or histogram estimates are other applications of posterior-based inference. See |31| 
for an overview and, e.g., |22l [251 [7] for the applications to remote sensing, algae 
population dynamics and image deblurring. 

This far, we did not specify the concrete form of the prior p pr (u), which is actually 
the most important step within the Bayesian formalism. A common choice linking 
Bayesian inference with variational regularization is given by Gibbs distributions: 

p pr (u) oc exp (~XJ(u)) (7) 

Here, rj(u) is an energy functional penalizing unwanted features of u, and A > is a 
scaling parameter that is called the regularization parameter. Now, after suppressing 
terms not dependent on u, the MAP estimate is given by 

"map = argmax {exp ( -— A\m - Au\\l - XJ(u] 
tiGR™ I V ® 

— argmin {\\m- Au\\l + (2a 2 \) J(u)} (8) 
This is a Tikhonov-type regularization of equation ([lj) |17| . 

In this article, we only consider Gibbs priors with a LI norm type energy functional: 

J(u) = \Du\, (9) 

where D g IR' X ", and | • | denotes the LI norm in Mr. Although such priors may seem 
like a generic extension of the one dimensional Laplace distribution to a multivariate 
setting, we note here that multivariate generalizations of Laplace distributions are 
commonly defined in a different way (see, e.g. |16|L Concrete examples of Ll-type 
priors will be given in Section [3j For the methods presented here, we require I ^ n, 
rank(D) = I and ker(Z?) n kcr(A) = 0. In forthcoming work, we will extend the sam- 
pler proposed in Section [23] to more general settings. 
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Remark: For the sake of an intuitive presentation of the Bayesian formulation 
of inverse problems, we started from a deterministic setting. For a more detailed 
and rigorous description on how to derive a discrete, computational model of the 
continuous inverse problem starting in the Bayesian framework, we refer to |35| . 

2.2. Posterior Inference using MCMC Sampling 

General Principles: In typical inverse problems scenarios, the dimension n of the 
unknowns u is very large (in our computational examples, we will study a scenario 
where the limit n —¥ oo is of central interest). Therefore, the integration to compute 
the CM estimate ^ is intractable by means of traditional quadratures. Interval, 
conditional covariance and histogram estimates and even more sophisticated topics 
in Bayesian inference like marginalization, model selection or experiment design |52| 
also rely on integration tasks and can, thus, not be computed by such an approach 
as well. Integration by Monte Carlo methods can avoid these difficulties. A sequence 
of points Ui, i = 1, . . . , K is constructed, which is distributed like the posterior (the 
construction schemes are called sampler or sampling methods). If they were drawn 
independently, the law of large numbers would guarantee that 

4 Kj ^° E[/(«)|m]= / f(u) Ppost (u\m)du (10) 

for any measurable / almost surely and in LI with rate 0(K~ x l 2 ). This means that 
the empirical mean of the sequence /(itj), i = 1,...,K converges to the expected 
value of f(u) w.r.t the posterior |32j . A difficulty in our setting is that the posterior 
is not given in a form that allows for drawing independent samples. It is only known 
up to a normalizing constant (the model-evidence) and does not belong to a class of 
distributions for which independent sampling schemes are known. However, by the 
strong ergodic theorem, the above convergence (and its rate) still holds if the sequence 
is dependent, but originates from an ergodic Markov chain that has p p0 st(u\m) as its 
equilibrium distribution |32| . Techniques to construct such chains are called Markov 
chain Monte Carlo (MCMC) methods. A huge number of different MCMC methods 
have been proposed. However, no method is known, which exhibits a good performance 
for all types of distributions. For a comprehensive overview, we refer to |37| . for the 
application to inverse problems, see [3T] . Most MCMC methods rely on one of two 
basic sampling schemes, which we will introduce in the next sections. Instead of 
comparing all possible and sophisticated variants of these schemes in our studies, we 
will use a small number of simple variants and focus on the differences between the 
two basic schemes for Ll-type priors. 

Metropolis-Hastings Sampling: For the ease of presentation, we denote the target 
probability density we want to sample by p(x), x £ K n . The Metropolis-Hastings 
(MH) algorithm [39, 25] is a very simple rule to generate a Markov chain: 

Algorithm 1. (Metropolis-Hastings Sampling) Let q(x,y) : 1" x R" — > R + be a 

function satisfying J q(x,y)dy = 1 for all x £ R" (proposal distribution) and Xq £ K™ 
an initial state. Define burn-in size Kq and sample size K . 
For i = 1,.. .,K + K do: 

1 Draw y from the proposal distribution q(xi-\, y). 
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r(xi-x,y) = min 1 



2 Compute the acceptance ratio 

p(y) q(y, Xj-i) 

p(af»-i) q{%i-i,y) 

3 Draw 9 £ [0, 1] from a uniform probability density. 

4 If r J? 9 , set Xi = y, else set Xi — Xi-\. 

Return xk +i , • ■ • , Xk ■ 

Note that the restrictions on p{x) for this scheme are minimal: We only have to 
know p(x) up to a scaling factor, as only ratios of probabilities are used, and we only 
need to be able to evaluate p(x) for any given x. Each sampling step requires one such 
evaluation (in inverse problems, the computational demanding part of this evaluation 
is usually applying the forward mapping A). The numerical implementation of the 
raw MH scheme is trivial. MH can, thus, be considered as a "black-box sampler", 
which explains its success in many different application areas |37j . 
However, while the scheme works for all kinds of proposal distributions in theory, 
its application is only feasible if q(x,y) leads to a chain that moves "fast" in the 
sampling space with respect to computational speed. This way, the important regions 
of the sampling space are explored reasonably fast and consecutive samples are as 



uncorrelated as possible (which improves the convergence in ( 10 )). These requirements 
are hard to fulfill in practice. We refer to the discussions in |3"Tl [37]. Usually, 
one ends up in a well-known dilemma of tuning different opposing parameters by 



manual inspection of different chain characteristics (see Section 2.4). Additionally, 
different applications usually require to develop and implement specific proposal 
distributions. As a consequence, a huge number of different MH-based schemes exist 
|37| . Especially for inverse problems, sophisticated algorithms that include automatic 
tuning procedures for the sampling parameters have been developed [2"2l |2"51 [5]. 
However, as mentioned earlier, a detailed comparison of their performance in sparse 
Bayesian inversion is not the topic of this publication. We rather want to compare 
the basic variants of MH algorithms and their performance to basic variants of Gibbs 
sampling algorithms. Therefore, we will use three proposal distributions that are 
commonly applied in practice because of their simplicity. They all belong to the class 
of symmetric random-walk Metropolis schemes |37| : 

y = x + d, E(0)=O, p6(u) «s(|M| a )Vw € R n , (11) 

for a suitable, non-negative function g. This means that a new proposal y is generated 
by perturbing the current state x in a random, unbiased, symmetric way. Thus, 
q(x,y) oc g(\\x — y\\), and q vanishes from the acceptance ratio (|T|). The three choices 
of $ we will use are: 

1) MH-Iso: All components of x are updated: ~ A/"(0, k 2 ), Vi. 

2) MH-Ncom: 1 < n* < n components i±, ...,i n * of x are randomly chosen and 

are updated while the other components remain unchanged: di ^ Af(0,K 2 ). if 
i € {h, . .. ,i n *}) else d l = 0. 

3) MH-Si: One component i* of x is randomly chosen and updated while all other 

components remain unchanged: i?^ ~ A/"(0, k 2 ), !?[_»,] = 0. 

Here, i?r_j] denotes all components of $ except the i th one. The concrete choice of n* 
and k will be explained in Section [3] 
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Gibbs Sampling: In certain scenarios, direct sampling of a n-dim multivariate 
distribution is not possible or computationally too expensive, but direct sampling 
from conditioned (thus, lower dimensional) versions of that distribution is feasible. In 
such a situation, Gibbs sampling can be applied. By successive sampling from the 
lower dimensional conditional distributions while changing the coordinates, which are 
fixed in each step, a Markov chain is generated |20| 1TB] . The most basic scheme is 
given by: 

Algorithm 2. (Single Component Gibbs Sampling) Let xq E R n an initial state. 
Define burn-in size Kq and sample size K 
For i=l,.. .,K + K do: 

For j = 1,. . .,n do: 

1 Set s — j (systematic scan) or draw s randomly from {l,...,n} (random 
scan). 

2 Draw (xi) s from the conditional, 1-dim density p(- 
Return xk +i , • ■ • , xk ■ 

We will abbreviate the systematic scan version of the above sampler as SysGibbs 
and the random version (which requires the extra computational effort of picking a 
random coordinate) as RnGibbs. 

The basic Gibbs sampling scheme can be very slow if the correlations between the 
single components Xi are strong. This occurs naturally in typical under-determined 
inverse problems. In this case, the conditional distributions differ considerably from 
the corresponding marginal ones. As a consequence, the chain moves very randomly, 
exploring the search space very slowly. To address this problem, overrelaxed variants 
of Gibbs sampling have been proposed. The idea behind them are similar to those used 
in overrelaxation techniques for the iterative solution of systems of linear equations 
|48| . The specific form of overrelaxation that we will apply and examine was proposed 
in [43], and relies on order statistics. Step 2 in Algorithm [2] is replaced by: 

Algorithm 3. (Ordered Overrelaxation) 

2.1 Draw No random values from the conditional, 1-dim density p( ■ \(xi)[_ s ]), where 
N 6N is odd. 

2.2 Arrange these No values plus the old value (xi) s in non- decreasing order, labeling 
them as follows: 

(x t )^ < ( Xi )W < < = < . < ( Xi )("o) (12) 

2.3 Replace ( Xi ) s by (x 2 ) { s N °~ t} . 

The value of No functions like an overrelaxation parameter. The larger the 
value of No, the larger the effect of overrelaxation and more randomness of the 
sampling process is suppressed. For symmetric densities, the current value of the 
component is mirrored at the mean and the whole chain moves on an iso-probability 
level of the density in the limit of No — > oo. We will discuss more details of ordered 
overrelaxation in Section |3.1.5| We will denote the overrelaxed versions of SysGibbs 
and RnGibbs by appending "OiVo", e.g., "SysGibbs07" denotes the systematic scan 
Gibbs Sampler with ordered overrelaxation using No = 7. While the basic scheme 
for ordered overrelaxation requires No times more computation time compared to 
Algorithm [3j an efficient implementation is given in |43| that renders the computation 
time nearly independent of No- We will present this form after the next paragraph. 
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2.3. Implementation of Gibbs Sampling for LI -type Priors 

In this section, the main contributions of this article are presented. The general Gibbs 
sampling schemes (Algorithms [2] and [3| need to be implemented in an efficient way. For 
this, we will first derive a way to compute a simple representation of the conditional 
single component density and then explain how to implement an exact, explicit and 
numerically robust sampler for it. 



Conditional Densities for Ll-type Priors: We will now derive the single component 
conditional densities required by Algorithms [2] and [3] for our setting (cf. Section 2.1). 



Because rank(D) = I (cf. Section 2.1 )., we can find v\, . . . , vi £ MP such that Dvi — e.;. 
(where e, denotes the i th unit vector in M 1 ) and fz+i, . . . , v n £ M." such that v%, . . . , v„ 
form a basis of K™. Then, we have 

n I 

u = ^2^m, Oii=(6,-,(l)' and |Du| = ^|&|. 

i=l i=l 

With V := [v\, . . . , v n ], we can transform the posterior to: 

exp (-j^sWrn- Au\\% - \\Du\ \ = exp ( -^\\m - AV£\\l - A^ |&| ) 



:= expl-||m-¥£||a - A E<^l) > ( 13 ) 

where m := m/(y/2a) and \1/ := (AV)/(\/2a). Because the transformations are 
linear, no specific attention to the correct transformation of probability densities has 
to be paid. Now let tpi be the i th column of be ^ without the i th column and 

£r_,-i be £ without the i th entry. Then 

m - * £ = (m - *[_,;] £[_i]) - fa & := ip[_q - ipi (14) 

Consider the conditional posterior of given rh and 

P(€i\m, £[_»]) oc exp ( — II — <| - ^t&lli ~ • 

= exp (- ((/?[_,-] - ^i£i, </J[-i] - - A l&l ' l{is£0) 

cx exp (-HVilllC- + 2^V M 6 - A|6| • (15) 
To ease the following presentation, we define: 

x := e t ; a := b := 2^V H] - 2 m - * M ) £ H] ] ; c := A • l {i<0 (16) 

Thus, the problem of sampling from the single component conditional densities is 
reduced to sampling from the 1-dim density 

p(x) oc exp(— ax 2 + bx — c \x\), (17) 



once a, b and c have been computed by (16). In the next paragraph, we will describe 



how to use the inverse cumulative distribution method [31 to sample from (17). 
Concerning the practical implementation of computing a, b and c in a fast way, note 
that a and c can be precomputed and only b depends on the current state of the 
chain £ through the term (ipj ^[-i]) £[—»]■ The most efficient way to compute this term 
strongly depends on the form of A and V, on the problem size n and on the hardware 
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available. If enough working memory is available to store the nx n matrix $ := ^'W, 
the most efficient way is to compute 

(ti* l - i] )z l - i] =e®(.,i)-ZiUi\\i (is) 

because the most extensive operation is a scalar product of dimension n. In the 
scenario examined in |35| . A is a symmetric convolution operator and V an inverse 
wavelet transform, i.e., Vj are the wavelets. For large k and n (as encountered, e.g., in 
2D or 3D imaging applications) , it is infeasible to compute and store the matrix form 
of A, V or Then, it is advantageous to use 

M *H]K M - 4 m) - ZiWiWl = ^2 ( A ■ Ve J ( AV - Willa 

= ~[A- (Vet)]* [A (V0\ - ZiWiWZ = ~v\ [(A* A) ■ (V$] - &||^||». (19) 

Here, (V£) can be realized using the fast wavelet transform, while the double 
convolution by (A 1 A) can be substituted by a single convolution with a different 
kernel and realized by the fast Fourier transform. 



Explicit ID Sampling: Sampling from continuous 1-dim distributions by the inverse 
cumulative distribution method follows a simple rule: Let F(y) := p(x)dx be the 
cumulative distribution function (cdf) and r be a random number uniformly drawn 
from [0, 1]. Then, y — F _1 (r) is distributed like p(x) (see [37]). We can also use this 
concept to provide an equivalent implementation of Algorithm [3] For for a given No ■ 

Algorithm 4. (CDF Implementation of Ordered Overrelaxation) 

2.1 Compute r = F [(xi) s ], which lies in [0, 1]. 

2.2 Let r' be the random ordered overrelaxation of r w.r.t to the uniform distribution 
on [0, 1] and No (computed with Algorithm^ . 

2.3 Replace { Xi ) s by F -1 (r'). 

For more details, we refer to |43j . Turning these rules into efficient sampling 
schemes requires a fast and stable way to invert F, which is defined by an integral. 
Using numerical integration for this purpose often fails to render fast and robust 
sampling algorithms. In this paragraph, we will present a scheme that relies on 
the inverse complementary error function (erfcinv) for which efficient and stable 
implementations are known. 

First, we compute the normalization factor for p(x) cx exp(— ax 2 +bx — c \x\). Splitting 
the integral from — oo to oo into two parts (from — oo to and the rest) yields 
subproblems that can be treated like the normalization of the normal distribution 
(completing the square and a linear integral transformation). This leads to: 



/oo 
exp(— ax 2 + bx — c |x|)da; 
-oo 



(fr+c) 2 c (b + c\ ic-b) 2 fc — b\ 
e 4 » ertc — = + e 4 » erfc — = 
V2vV \2Va J 



— OO 

_ 1 fit 
~ 2V a 

:= x [e+ erfc (a + ) + e_ erfc (a_)] , (20) 

where erfc(y) := ^= dt denotes the complementary error function. The cdf is 

given by: 

1 f y 

cdf(y) := — -j. / cxp(— ax 2 + bx — c |x|)da; 

* J — oo 
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x_ 



e+erfc (—y/ay + a + ) 
e+erfc (a+) + e_ [erfc (a_) 



, ify < o, 

rfc [y/ay + <*-)] , if y > 0. 



(21) 



Inverting this cdf for a given r 6 [0, 1] is simple. To find y = cdfinv(r) we first check 
if y < by using the cdf for this domain. Let 



z := 



crfcinv 



rjV 



= erfcinv 
= crfcinv 
= erfcinv 



r\ [e+ erfc (a+) + e_ erfc (ce_)] 



erfc (a+) H erfc (a-) 

e + 



erfc (a+) + exp I erfc (a_) 



(22) 



then, y is given by y = — (z — a+)/y/a. If it turns out that this y fulfills y > 0, the 
other half of the cdf has to be inverted. Let 

:=- (Tinny \ — + e+erfc (oq_) + e_crfc (a.— ) 



el 1 



erfcinv < (1 — 



exp 



erfc (a+) + erfc (a_) 



(23) 



Then, y is given by y = (z — a^)/y/a. 

The complementary error function and its inverse are difficult to handle numerically, 
because there are no identities that allow to rescale or shift their evaluation to other 
intervals. Therefore, a robust numerical implementation of formulas (21 1, (22 1 and 
(p3| is rather involved. For the sake of a concise presentation, we present all details 



Appendix B| 



2-4- MCMC Convergence Diagnostics 

Assessing the efficiency of a sampling algorithm for a general purpose rather than a 
specific aim is a difficult task |37| . Two types of convergence diagnostics are usually 
applied: Qualitative diagnostics rely on the visual inspection of some property of 
the chain itj, i — 1,...,K. In contrast, quantitative diagnostics try to compute 
characteristics that can be used to guide the sampling algorithm in an automated 
fashion. This should allow unexperienced users to perform "black box" Bayesian 
inference. Despite a lot of research on theses topics |12[ |4j |46j [51] , no universal method 
is known. For our purpose, a qualitative autocorrelation analysis is appropriate. For 
a test function g : K ra — ► R , the autocorrelation function (acf) R : {0, . . . ,K — 1} — > 
[—1, 1] of the series gi := g(ui), i = 1, . . . ,K is given by: 

1 K ~ T 

r{t) := (k- r ^ i9t ~ A)( - 94+r " A) (24) 
i K i K 

i=i i=i 

(Note that there are other possibilities to define R, but we need R{0) — 1). The value 
of R(t) is referred to as the lag-r autocorrelation w.r.t. g. A fast decrease of the 
acf indicates that consecutive samples get mutually independent quite soon (if the Uj 
would be independent, then, R(t) = 5( T ,o))- For practical considerations, the decrease 
of autocorrelation w.r.t. to the raw number of samples drawn is not decisive if different 
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samplers are compared. A method that has a slower decrease than others might still 
outperform them if it produces new samples considerably faster. In such situations, 
one would subsample the chain to get rid of highly correlated samples and to safe 
memory. Note that the notion of "one" sample is quite arbitrary anyway. In the 
SysGibbs sampler, one speaks of a "new" sample, if all components of u are updated, 
in the MH-Si sampler one speaks of a "new" sample, if one component is updated. 
To address this, we will normally scale the acf by the computation time per sample 
t s : R*(t) := R(t/t s ) for all t — i ■ t s , i £ {0, . . . ,K — 1}, if we compare conceptually 
different sampling methods. R*(t) measures how fast a sampler can produce a certain 
loss in autocorrelation, which is of main interest for practical applications. However, 
while R*(t) is more decisive to compare different samplers, it relies on their concrete 
implementatioijf] 

Normally, the test function g is chosen with respect to the specific aim of inference. 
For instance, one could use the distance to the empirical mean of the whole chain 
if CM estimation is performed, or the projection onto a specific coordinate if that 
coordinate should be marginalized. Then, the rate of autocorrelation decrease is a 
measure of the efficiency of the chain for the specific inference aim. For our general 
purpose, we will test the "worst case". We project onto the direction of the largest 
variance, i.e., the first eigenvector v\ of the covariance matrix C of the posterior: 

g(ui) := (vi,Ui) (26) 

In general, the chain should have most problems to reduce the correlation of subsequent 
samples in this direction v\. For each scenario we examine, the covariance matrix C 
of the posterior is estimated from a long (sub-sampled) chain of the RnGibbs sampler, 
as this sampler will turn out to be the most reliable at a high performance. Note that 
this choice does not give an advantage to the RnGibbs sampler in the autocorrelation 
analysis but rather a disadvantage if the other samplers would have other directions of 
highest variance. We checked that this is not the case in a test scenario we examined 
in preliminary studies. 

Other possible MCMC convergence diagnostic plots that are commonly used are plots 
of \og[p(ui\m)] or of single components (ui)j. Such plots are good to detect possible 
multimodality of the posterior and to determine a sufficient number of burn in steps 
Kq. Multimodality is not an issue in our case, as the posterior is log-concave (the 
energy — log[p(u|m)] is convex). The burn-in length is an important factor for the 
practicability of the algorithms (and we will address this issue in our studies) but it is 
a difficult measure for a fair and definite comparison of the sampling methods. First, 
it crucially relies on the initialization of the chain, so one would have to compare all 
methods for various common initialization strategies, which is not really feasible and 
too application specific. Second, for the Metropolis-Hastings schemes, an adaptation 
of the sampling parameters to is usually carried out in the burn-in phase with the aim 
to optimize the performance of the chain in the real run. We will introduce this topic 
m Section [3T2| The consequence is that Kq also depends on the adaptation scheme, 
which renders the problem of a meaningful comparison even worse. 

| We implemented all samplers in Matlab and optimized them to yield the best possible performance. 
As mentioned in Section |l.2| we originally indented to use the MH samplers in the scenario examined 
in Section |3.1| As their results were unsatisfactory even after a careful optimization of their 
implementation, we decided to develop the Gibbs samplers presented in this paper. 
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3. Results 

In this section, we compare the sampling algorithms for two scenarios: Edge- 
preserving, TV-based image deblurring in ID and impulse prior based image 
deblurring in 2D. All algorithms have been implemented in Matlab and have been 
optimized to the best possible performance. All results have been computed on the 
same CPU architecture limiting Matlab to a single computational thread, i.e., to use 
a single CPU core with 2.80GHz (parallelization is discussed in Section |4p. We paid 
special attention that the computation times are as comparable as possible. 

3.1. Edge-Preserving Bayesian Inversion in ID 

A popular case of Ll-type priors arises from edge-preserving image reconstruction. The 
task is to reconstruct a spatially distributed intensity image that is known to consist 
of piece wise homogeneous parts with sharp edges from indirect, noisy measurements 
(e.g., the recovery of the body's organs and their boundaries from X-ray computed 
tomography data |34lH51l3~T] L Using Gaussian, i.e., L2 -type priors smooths the image 
edges in such situations. In contrast, total variation (TV) priors, which rely on the 
LI norm of the first spatial derivatives, are able to retain them |4"T1 l3"Tl I3"51 15]. The 
use of TV priors in Bayesian inference has led to interesting theoretical questions. 
It was discovered that it is not possible to formulate the conventional TV prior in a 
discretization invariant way |36II35| , i.e., that the posterior converges to a well defined 
limit probability density when the level of discretization is increased while reflecting 
the a priori information of edge-preservation at all levels of discretization. If the TV 
prior is formulated such that it converges, it converges to a Gaussian smoothness 
prior, and, thus, the edge-preservation property is lost. This motivated research on 
whether and how it is possible to formulate edge-preservation as a priori information in 
a consistent, discretization invariant way in the Bayesian framework. Recently, Besov 
space priors have been proposed, which rely on a weighted LI norm of wavelet basis 
coefficients [35, 33 . Such priors are Ll-type priors with an invertible D, thus, posterior 
sampling by means of our Gibbs sampling algorithms can be performed with ease. In 
addition, modifications of the standard TV prior |11| and hierarchical Bayesian models 
|5] |57] [55] |2] have been proposed for discretization invariant edge-preserving image 
reconstruction as well. 

To address the problem of discretization invariance of a prior, one can, e.g., study 
the convergence of the corresponding CM estimate for n — > oo. The problems of 
using MH-based samplers for CM estimation in high dimensions have already been 
noticed in |36l I33j and our research on alternative samplers has been motivated by 
these problems as well. 

3.1.1. Setting We rely on the setting used in [36 . The motivation is to mimic 
a measurement made by a charge coupled device (CCD) used in digital cameras or 
medical imaging devices. These devices integrate the amount of light illuminating 
a certain pixel over a certain period of time. In the continuous model setting, we 
represent the unknown light intensity by a positive function u : [0, 1] — > R, and the k 
pixels of the CCD device as a equidistant division of the subinterval [ , fq^] C [0, 1] , 
i.e., the j-th pixel is represented by the interval [j^, fqi^>]- The measurement at the 
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■iii 



pixel is then given by: 



u(t)dt - 



For convenience, we will choose k 



3 + 1 
k - 2 



2 and L r , 



choose the grid • ■ ■ , ^pf} C [0, 1], and let n — 2 Lu — 1 with L u > L 



(27) 



5. For discretizing u, we 
The 



discretization of the forward mapping implied by ( 27 1 in terms of the k x n matrix 
A can then be implemented by the trapezoidal quadrature rule. The j row of A is 
given by 



,0, ^h, h,h, 



,0], 



(28) 



7 - 2 (t ll --f- m )-i 



where h := -^-j defines the grid size. The discrete TV prior with Neumann boundary 



conditions in our situation is given by: 

(n-l 
i=i 



-i-Ui\J =exp(-A„|£>n|) (29) 

(30) 

where D 6 k(™- 1 ) x ™ j s given by {Du)i :— Ui + \ — m, i = 1, . . . , n — 1. We indexed A„ 
by n to stress that we may choose it depending on the discretization level. 



For the Gibbs sampler (cf. Sections 2.1 and 2.3 ), we note that I = rank(Z?) = n— 1, and 
v\, . . . , v n -i G R n are given by step functions: (vi)j — l/^n. These are completed to 
a basis of W 1 by {v n )j — 1 Vj. If we reorder them and define V := [v n , v\, . . . , v n -i], 
we can write 1/ as 



if « > j 
else 



(31) 



The unknown function u we actually use is the indicator function on [|, |], see Figure 



|l(a)| Measurement data m is generated using formula (27 1, see Figure 1(b) The 
standard deviation of the measurement noise a is 0.001. We will examine different 




(a) (b) 
Figure 1. Left: The unknown function u(t). Right: The measurement data m. 



combinations of n and A„ : 

A: n = 63 in combination with A„ = 100, 200 and 400, respectively. Here, we focus 
on increasing the impact of the prior. The posterior will become less Gaussian 
because the weight of the Ll-type TV prior is increased. 
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B: n = 2 Lu - 1 for L u = 7, 8, . . . with A„ = 25 • \fn + 1. With this scaling of A n , 
the posterior p(u\m) converges for n — > oo, but the edge-preserving property of 
the TV prior is lost, see for details. The CM estimate ucm converges to a 
smooth limit function, which will facilitate the visual validation of the results of 
the different MCMC methods. 



3.1.2. Preliminaries 



Choice of Parameters: For the MH schemes, the proper tuning of k is essential. If it 
is very small, the proposals will always be accepted since the distribution to sample 
from is continuous. However, in return the exploration of the sampling space is slow. 
On the contrary, if is too large, the differences in probability will be huge because 
the distribution is log-concave and new proposals will hardly be accepted. A good 
overview on this topic is given in |45[ l4"2"] . The remarkable result is that in high 
dimensions, having a total acceptance rate of new proposals of about 0.234 leads to an 
optimal efficiency independent of the distribution to sample from. Furthermore, this 
optimal efficiency hardly drops in the range between 0.1 and 0.4 of acceptance rate. 
This yields an easy to implement rule to tune k: One could find the optimal k in a 
preliminary MH-MCMC run and initialize the real MH-MCMC run with it. However, 
it turns out that this k is only optimal once the chain has reached the main support 
of the distribution while it can hinder the chain from ever getting there (the burn-in 
length increases dramatically, see Section 2.4 1. For these reasons, on-line adaptation 
of k is usually used. The empirical acceptance rate is monitored, and n is increased 
if it is too high while k is decreased if it is too low. The scheme we use is that every 
10 000 samples, the empirical acceptance rate is computed and if it is above 0.35, k is 
multiplied by 1.2 while it is multiplied by 0.8 if it is below 0.15. In theory, the resulting 
chain will then not be a Markov chain anymore (but it is still ergodic). However, in 
practice, using this scheme, k hardly ever changes once the burn-in time is over and 
so the real chain is not affected. 

For MH-Ncom, we have to choose n* , i.e., the number of components that are updated 
in one step. We choose n* — [n 7 ^ 12 J , which roughly corresponds to the values used in 



Burn-in Times: As noted in Section |2.4[ the sufficient amount of burn-in steps that 
have to be drawn can be deduced from observing log[p(ui|m)]. Once it oscillates 
around a constant value, the stationary part of the distribution is reached. Averaging 
log[p(uj|m)] over a large number of independent chains that all started at the same 
initialization {u = in our case) removes the oscillations and allows to determine K$ 
in an easy fashion. See Figure [2] for an example of such a plot. In Table [T] the burn- in 
steps Kq and the corresponding computation times to are listed for the combinations 
of n and A that are examined in detail. It gives a first impression of how the methods 
scale with n and X n , but as noted in Section 2.4 it does not allow for a fair and detailed 
comparison. 



3.1.3. General Autocorrelation Analysis As explained in Section [2.4[ we will rely on 
autocorrelation plots for the projection of the samples onto the direction of maximal 
covariance as qualitative measures of the efficiency of the sampling algorithms. Figure 
[3] shows the autocorrelation plots R(t) for n — 63 and varying A. In Figure U the 
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Figure 2. Plots of log[p(ui\m)] for n = 1023, A = 800, using the RnGibbs 
sampler. Red «, *, -)-: three independent realizations. Blue X: the average of 
5000 independent realizations. 



Table 1. Necessary burn-in steps Kg and computation time to iu seconds 
for each method and combination of n and A when starting from u = 0. The 
values were found by observing the log posterior probability averaged over a large 
number of independent chains. Listed as (Ko,to) 

Model parameters (n,A) 

Method 

(63,100) (63,200) (63,400) 



MH-Iso 

MH-Ncom 

MH-Si 

RnGibbs 

RnGibbs03 

RnGibbs07 

SysGibbs 

SysGibbs03 

SysGibbsQ7 



(4e5,1.8el) 
(4e5,2.3el) 
(5e5,2.8el) 
(200,0.5e0) 
(200,0.9e0) 
(200,1.0e0) 
(400,1.0e0) 
(200,0.9e0) 
(200,0.9e0) 



(4e5,1.9el) 
(4e5,2.5el) 
(5e5,3.0el) 
(200,0. 5e0) 
(200,1.0e0) 
(200,1.0e0) 
(500,1. 3e0) 
(500,2. 2e0) 
(500,2. 3e0) 



(5e5,2.3el) 
(5e5,2.9el) 
(6e5,3.4el) 
(200,0.4e0) 
(200,0. 9e0) 
(200,0. 9e0) 
(500,1. 3e0) 
(500,2. 2e0) 
(500,2. 2e0) 



Model parameters (n,A) 

Method 





(127,280) 


(255,400) 


(511,560) 


(1023,800) 


MH-Iso 


(7e5,3.4el) 


(4e6,2.1e2) 


(3e7,1.9e3) 


(2e8,1.7e4) 


MH-Ncom 


(7e5,4.2el) 


(4e6,2.6e2) 


(3e7,2.5e3) 


(2e8,2.3e4 ) 


MH-Si 


(8e5,4.4el) 


(4e6,2.3e2) 


(3e7,1.9e3) 


(2e8,1.5e4) 


RnGibbs 


(80,0.4e0) 


(50,0.4e0) 


(30,0.5e0) 


(20,0.6e0) 


RnGibbs03 


(80,0.7e0) 


(50,0.9e0) 


(30,l.le0) 


(20,1.4e0) 


RnGibbs07 


(80,0. 7e0) 


(50,0.9e0) 


(30,l.le0) 


(20,1. 5e0) 


SysGibbs 


(150,0.7e0) 


(100,0. 9e0) 


(150,2. 9e0) 


(150,5.3e0) 


SysGibbs03 


(150,1.4e0) 


(150,2. 6e0) 


(150,5. 2e0) 


(150,1.0el) 


SysGibbsQ7 


(150,1.4e0) 


(150,2. 7e0) 


(200,6. 8e0) 


(200,1.4el) 



corresponding temporal autocorrelation plots R*(t) are shown. The comparison is 
split up into MH-based vs. normal Gibbs samplers and normal vs. overrelaxed 
Gibbs samplers to reduce the number of plots shown in one figure. The plots for 
RnGibbs03, SysGibbs03 and MH-Ncom were omitted for the same reason: The plots 
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for RnGibbs03 and SysGibbs03 lie between the plots of RnGibbs and RnGibbs07 
and SysGibbs and SysGibbs07, respectively. The plots of MH-Ncomp look similar 
to the ones of MH-Iso and lie between MH-Iso and MH-Si. Figures [5] and |6]show 
the autocorrelation plots R{r) for varying n and A„ = 25 • \/n + 1. In Figure M the 
temporal autocorrelation plots R*(t) corresponding to Figure [5] are shown. The plots 
for RnGibbs03, SysGibbs03 and MH-Ncom are, again, omitted. In addition, the plots 
for Z = 10 are not shown as the trends of the autocorrelation functions for growing n 
are already clearly visible. 

Table [2] lists the lag to.oi for which the autocorrelation R(t) drops below 1 % for the 
first time and the corresponding computation time to.oi = to.oi ■ ts- 




2g4 4e4 T 6e4 8e4 10e4 200 400 % 600 800 1000 



(a) MH-based samplers (b) Gibbs-based samplers 

Figure 3. Autocorrelation plots R(t) for n = 63, X n = 100, 200 and 400. 



3.1-4- Visual Results To get a visual impression of the sampling results, CM 
estimates are computed using the different samplers at different computation times 
for n — 1023, A n = 25 • \Jn + 1. The computation times examined are t* = 1 s, 10 
s, 1 minute, 1 hour and 1 day, respectively. Practically, a long chain with Kq = 
was generated and sub-chains corresponding to all the samples drawn before t* were 
extracted. Then, CM estimates were computed from the sub-chains by discarding 
Kq = mm(Ko, 0.5 -K*) burn-in samples, where Kq are the burn-in steps listed in Table 
[T] and K* denotes the number of samples in the subchain. The results are shown in 
Figure [8j We have to emphasize that we did not chose to show the CM estimate 
for the TV prior because the reconstruction is convincing. In fact, as explained in 
Section |3.1| they are extremly smooth compared to the corresponding MAP estimates 
and, thus, bad reconstructions of the discontinuous u . However, this smoothness is 
very useful for gaining a visual impression of the convergence and the properties of 
the different sampling schemes: The CM estimate computed from the chain converged 
once it is smooth. To demonstrate the capabilities of the newly developed Gibbs 
samplers for the practical use, we also examine the theoretical questions addressed in 
|36| . For the choice of A„ oc \Jn + 1, the TV prior converges to a smoothness prior. 
To support this finding with numerical simulations the CM estimate was computed 
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Figure 4. Temporal autocorrelation plots R*(t) for n = 63, A n = 100, 200 and 
400. 




Figure 5. Autocorrelation plots R(t) for varying n and A for MH and un- 
overrelaxed Gibbs sampler. Note that the r axis starts at r = 1 and is scaled 
logarithmically. 



for n = 63, 255, 1023, 4095 in (36] using the MH-Ncom sampler. Although the whole 
computation took about a month of time on a desktop PC equipped with a 2.8 GHz 
single core CPU, the authors admitted that the results were only partly satisfying. In 
Figure[9] we show the CM estimate computed for n = 63, 255, 1023, 4095, 16383, 65535 
using the RnGibbs sampler on a comparable CPU. Again, the CM estimates are only 
shown because the increasing smoothness of the CM estimates for growing n allows 
for the visual inspection of the chain convergence. 





Figure 7. Temporal autocorrelation plots R*(t) for varying n and A for MH 
un-overrelaxed Gibbs sampler. Note that the t axis starts at the smallest t s and 
is scaled logarithmically. 



3.1.5. Normal vs. ORR Gibbs Sampling Using oriented overrelaxation removes a 
certain amount of randomness from the generated chains. This may lead to a faster 
exploration of the posterior distribution, but can also enhance non-ergodic tendencies 
of the sampling approaches. In Figure [10] more detailed autocorrelation plots for the 
ID scenario using n = 1 023, A = 800 are shown. Whether oriented overrelaxation 
is practically advantageous relies on the computational cost of sampling from the 
single component density ( 17 1 costs compared to the other computation steps. Using 
oriented overrelaxation to sample from (17 1 takes roughly twice as much computation 



Fast Sampling for LI Problems 



19 



Table 2. Lag of 1% auto correlation for each method and combination of n and 
A in terms of (samples, computation time) 



Method 


Model parameters 


(n,A) 








(63,100) 


(63,200) 


(63,400) 


MH-Iso 


(4.1e4,2.1e0) 


(1.2e5,6.2e0) 


(2.1e5,l.lel) 


MH-Ncom 


(4.2e4,2.6e0) 


(1.0e5,6.4e0) 


(2.1e5 ,1.0el) 


MH-Si 


(4.7e3,0.3e0) 


(7998,0.5e0) 


(8.8e4,5.4e0) 


RnGibbs 


(1685,4.1e0) 


(1402,3.3e0) 


(561,1.2e0) 


RnGibbs03 


(1239,5. 8e0) 


(983,4.6e0) 


(395,1.8e0) 


RnGibbs07 


(1056,5. 5e0) 


(811,3.7e0) 


(318,1.4e0) 


SysGibbs 


(985,2.4e0) 


(810,1.9e0) 


(242,0.5e0) 


SysGibbs03 


(412,1.9e0) 


(242,l.le0) 


(76,0.3e0) 


SysGibbsQ7 


(137,0. 7e0) 


(97,0.4e0) 


(31,0.1e0) 



Model parameters (n,A) 

Method 





(127,280) 


(255,400) 


(511,560) 


(1023,800) 


MH-Iso 


(l.le6,5.0el) 


(4.6e6,2.5e2) 


(2.7e7,1.9e3) 


(1.3e8,1.3e4) 


MH-Ncom 


(9.4e5,5.4el) 


(3.2e6,2.2e2) 


(2.1e7,1.8e3) 


(1.8e8,2.3e4) 


MH-Si 


(4.8e4,3.1e0) 


(2.1e6,1.2e2) 


(3.1e7,2.1e3) 


(2.9e8,2.5e4) 


RnGibbs 


(2017,9. 2e0) 


(1014,8.7e0) 


(46,0.8e0) 


(39,1.3e0) 


RnGibbs03 


(1006,8. 9e0) 


(1052,2.0el) 


(31,l.le0) 


(29,2. Ie0) 


RnGibbs07 


(953,8. 7e0) 


(473,8.7e0) 


(28,0.9e0) 


(24,1.8e0) 


SysGibbs 


(770,3.4e0) 


(270,2.3e0) 


(9,0.1e0) 


(12,0.4e0) 


SysGibbs03 


(230,2. 0e0) 


(165,2.9e0) 


(8,0.3e0) 


(7,0.5e0) 


SysGibbsQ7 


(126,1. 2e0) 


(153,2.8e0) 


(7,0.2e0) 


(6,0.4e0) 



time as not using it, almost independent from No (when using Algorithm [4]). If 
the other computation steps in the whole sampling scheme take way more time (i.e., 



the computation of b, see Section 2.3 1, this extra computational cost is negligible. 
In our scenario, the ratio between the total computation time per sample t s for the 
RnGibbs07 and the RnGibbs sampler varies considerably. It drops from 1.93 for 



the fast but memory consuming implementation (see Section 2.3 ) to compute b and 



n = 63 to 1.02 for the slower implementation to compute b and n = 65 535. 
3.2. Image Deblurring with Impulse Prior in 2D 

3.2.1. Setting As a second example, we consider 2D image deblurring with a simple 
LI prior, i.e., D = I n (also called impulse prior). The unknown intensity function 



[0, 1] x [0, 1] — > M is shown in Figure 11(b) It consists of a couple of circular 



spots of constant intensity whose radii and intensities slightly vary between single 
spots. The forward mapping is given by a convolution with a Gaussian kernel 
with standard deviation of 0.015. Measurement data is generated by integrating the 
resulting convoluted image over 513 x 513 regular pixel and adding noise. The relative 
noise level is 0.1, i.e., the standard deviation a of the measurement noise is 0.1 times 
the maximal intensity of the noiseless signal. The resulting measurement data is 



shown in Figure 11(c) The image will be reconstructed on the same pixel grid used 
for the measurement using Neumann boundary conditions, thus, the dimension of 
the unknowns n is 511 2 = 261 121. To avoid an inverse crime, the grid used for the 
generation of the measurement data was 4 times finer. For the MH samplers, the 
same on-line adaptation of k was used as explained in Section |3.1.2| The update 
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Figure 8. Visual impressions of the CM estimate obtained by a sampler after 
a certain computation time (n = 1023). Computation times, which are framed 
by exclamation marks, indicate that the burn-in time listed in Table [l] was not 
reached yet. For the Gibbs samplers, a zoom into the interval [0.495, 0.505] is 
added. 



intervals and the up- and down-scaling factors have been chosen carefully to optimize 
the performance of the samplers while keeping the adaptation stable, i.e., monotonic. 
The forward mapping is implemented using ffts. For the Gibbs sampler, we note that 
V = I n and that (19 1 simplifies to 

($*M]Kh] = ^ [(A* A) (32) 
which can be implemented in a efficient, direct way. 




Figure 9. CM estimates for growing n and A n <x \/n + 1. Sampling was 
performed with the RnGibbs sampler. 



DC 



o.: 




(a) RandGibbsO* (b) SysGibbsO* 



Figure 10. Autocorrelation plots for random and systematic scan Gibbs sampling 
using different values for oriented overrelaxation parameter No- Red ♦ : Ol (i.e., 
no overrelaxation at all); green A : 03; orange T : 07; pink ■ : 013; blue • : 
021. 



3.2.2. Visual Results The practical procedure to compute visual results is identical 
to the one used in Section |3.1.4| In Figures [T2fr4| the CM estimates computed after 
1, 5 and 20 hours are shown. The results of MH-Ncomp. RnGibbs03 and SysGibbs03 
are omitted here. Choosing a good scaling to compare the results for a single sampling 
method is not easy because of outliers in the lh image. These outliers would either 
lower the contrast if a simple linear min-max scaling based on all images is chosen or 
would lead to the impression that constant regions are growing if an individual scaling 
for each image is used. The scaling we used is generated in the following way: For 
each method, we merged and sorted the pixel values of all three CM estimates. From 
this sorted set, the smallest and largest values are discarded, using 0.1% and 99.9% 
as thresholds, respectively. A linear min-max scaling is generated from the remaining 
values, and the discarded values are mapped to the beginning and end to this color 
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(a) 



(b) 



(c) 



Figure 11. Left: The colour scale used in all 2D images. Middle: The unknown 
function u. Right: The measurement data m. 



scale, respectively. 

An examination of log[p(uj|m)] was again used to determine the burn-in steps Kq. 
For the Gibbs samplers, the burn-in steps were between 12 and 30 and in all images 
shown, the burn-in phase was already completed. For the MH samplers, the plots of 
log[p(iti|m)] suggested that even after 20 hours of computation, the chain was still 
far away from the central parts of the posterior, which is also evident from the CM 
estimates. 



4. Discussion and Conclusions 



4-1- MH-Samplers 

For the specific scenario we examined, the efficiency of the basic MH samplers decreases 
when either the influence of the Ll-type prior increases (i.e., A is increased) or the 
dimension of the unknowns, n, is increased. Figures [3(a)] [4(a)] [5] and [FJ and Tables [l] 
and[2]clearly document this. The largest number of unknowns examined was n — 1 023, 
which is still moderate for typical inverse problem scenarios. However, for this number 
of unknowns, both the burn-in time and the time to decrease the autocorrelation of 
a new sample below 1% are in the order of a few hours. In Figures 8(a) - 8(c) this 



is visualized by the slow convergence of computed CM estimates in the ID scenario. 
A (visually) satisfactory result is only obtained after 1 day of computation time. 
In the 2D example, no satisfactory result could be obtained, even after 20 hours of 
computation time (see Figure 12 1. The examination of log[p(itj|m)] suggested that the 



computation required to obtain such a result is of orders larger. In total, our detailed 
studies support the empirical findings of former applications of basic MH-samplers to 
Ll-type priors, see, e.g., [551155] . 
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(a) MH-Iso, lh (b) MH-Iso, 5h (c) MH-Iso, 20h 




(d) MH-Si, lh (e) MH-Si, 5h (f) MH-Si, 20h 



Figure 12. Visual results for image deblurring with an impulse prior in 2D, part 
1. In the upper left corner of each subfigure, a zoom into the marked area in the 
original figure is shown. The scaling used in these images is explained in the text. 



4-. 2. Gibbs- Samplers 

We again stress that the Gibbs samplers we proposed and examined have to be 



considered as very basic variants of Gibbs sampling (cf. Section 2.2). This makes 
it even more surprising that for these samplers show totally different trends compared 
to the MH-samplcrs. 

Random Scan Gibbs Samplers: For RnGibbs, RnGibbs03 and RnGibbs07, the 
required burn-in steps stay constant when increasing A and clearly decrease when 
increasing n, cf. Table [l] Even as the computational costs of drawing a new sample 
increases with n, this effect keeps the computational time to draw the required number 
of burn-in steps almost constant. Figures [3(b)[ |4(b)| and Table [2] show that the decay 
of R(t) and R* (t) is even faster for increasing A. From Figures |5j [6] and Tabl e [2] we 
see that for increasing n, this is also true for R(t). For R*(t), we see in Figure |7| that 
for large n the temporal decay cannot further decrease. This is a normal saturation 
effect because the the autocorrelation decrease is bounded, ft would even occur for 
an i.i.d. series of n dimensional random variables, if the computation time would 
increase with n. The visual results (see Figures [8(d)1 - [8(f)] ppl] ) clearly support 



these findings. Especially the short burn-in times are noticeable. In both scenarios, 
the CM estimate using the shortest computation time already represents the most 
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(d) RnGibbsQ7, lh 



(e) RnGibbsQ7, 5h 



(f) RnGibbsQ7, 20h 



Figure 13. Visual results for image deblurring with an impulse prior in 2D, part 
2. In the upper left corner of each subfigure, a zoom into the marked area in the 
original figure is shown. The scaling used in these images is explained in the text. 



important features of the final solution. Using oriented overrelaxation in combination 
with random scan Gibbs sampling does not seem to lead to any problems concerning 
the ergodicity of the chain. The autocorrelation plots in Figures |3(b)| [6| |10(a)| are 
still monotonic and positive. Clearly, the decay of R(t) is faster using overrelaxation. 
Concerning R*(t), overrelaxation is only effective, if the additional computational cost 
is negligible compared to other parts of the sampling process. 



Systematic Scan Gibbs Samplers: For SysGibbs, SysGibbs03 and SysGibbs07, the 
results are less clear. At first glance, the trends in their results seem to be rather 
similar to the random scan samplers and within a direct comparison, they often seem 
to outperform them, see, e.g., Table [2] However, Figures 3(b)| 6 and |1 0(b)] show that 
for growing n and No, the plots of R(t) start to oscillate and are clearly negative in 
some areas. It seems that in combination with the TV prior, the subsequent update 
of neighboring increments leads to non-ergodic tendencies in the sampling procedure. 
These tendencies are amplified when using oriented overrelaxation. Producing anti- 
correlated samples may in fact advantageous for certain tasks [37]. However, we would, 
in general, not advise to use systematic Gibbs sampling. The additional computational 
cost of drawing a random component to update in each step is small. In contrast, a 
sampler that relies on a non-ergodic mapping may produce unpredictable results for 
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(a) SysGibbs, lh 



(b) SysGibbs, 5h 



(c) SysGibbs, 20h 




(d) SysGibbs07, lh 



(e) SysGibbs07, 5h 



(f) SysGibbs07, 20h 



Figure 14. Visual results for image deblurring with an impulse prior in 2D, part 
3. In the upper left corner of each subfigure, a zoom into the marked area in the 
original figure is shown. The scaling used in these images is explained in the text. 



certain tasks. 
4-3. General 

There are multiple reasons for the loss of performance of the basic MH samplers 
compared to the basic Gibbs samplers in the specific scenarios we examined. The 
crucial part for an MH sampler is the design of a good proposal distribution. As 
explained in Section |2.2[ the basic MH samplers we applied are "black-box sampler" 
algorithms. In the design of their proposal distributions, no specific information 
about the posterior was taken into account. In return, they usually exhibit very 
fast computation times. The standard proposal distributions we used are designed to 
sample from low dimensional, Gaussian-like distributions. However, high dimensional 
posteriors from sparsity promoting priors have very different properties. Standard 
MH-samplers have to take very small steps to obtain a good acceptance rate. This 
leads to long burn-in times and a slow decrease in autocorrelation. The situation 



is similar with optimization algorithms used for MAP estimation (cf., Section 2.1). 
Black-box optimization algorithms that only rely on evaluating the objective function 
(i.e, log[p(u|m)]) are usually too slow when applied to specific, high dimensional 
posteriors. The basic Gibbs samplers we proposed incorporate more posterior-specific 
information into the sampling procedure at the costs of a larger computation time. The 
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conditional single component densities, which can be regarded as optimal transition 
kernels, are computed and sampled from explicitly. This small extra amount of 
incorporating problem specific information already seems to be sufficient to generate 
very promising sampling procedures for high dimensional Bayesian inversion using Ll- 
type priors (the dimensions of the unknowns used in Figure [9] and the 2D scenario 
are far beyond any previously reported use of MCMC for Ll-type inverse problems). 
In general, both sampling techniques have advantages and disadvantages and will 
outperform the other given a specific scenario. 

For the very reason that we only used very basic Gibbs samplers, our results also 
challenge common beliefs about the feasibility of MCMC sampling in high dimensional 
inverse problems in general. We showed that MCMC schemes are not in general slow 
and scale bad with increasing dimension. We rather think that MCMC schemes for 
inverse problems are far less elaborate compared to optimization schemes up to now. 
With regard to the corresponding optimization algorithms for MAP estimates, one 
possible reason for the superior performance of the single component Gibbs samplers 



might be the transformation of the posterior into the basis v\, . . . , v n (cf., 2.3 1. In this 
basis, the prior diagonalizes. It can be shown that the MAP estimate is sparse in this 
basis, i.e., many basis coefficients are exactly zero. Many optimization algorithms 
to compute the MAP estimate take advantage of this and perform better when 
transformed into that basis. One could argue that this might be the case for the 
sampling procedures as well, and that a fair comparison between MH and Gibbs 
samplers would need to transform the MH samplers into the basis V\, . . . ,v n as well. 
However, there are reasons why this argument is not valid. The striking advantage 
of MH samplers is their simple, "black-box"-like implementation. In practice, they 
are normally implemented in the most direct way and we stuck to that paradigm. 
In addition, in the 2D case, = e,, i.e., both MH and Gibbs samplers are already 
formulated in the right basis. However, this does not affect the bad performance of the 
MH samplers compared to the Gibbs samplers. But most importantly, while the MAP 
estimate is sparse, the CM estimate is not, and single samples from the posterior are 
not sparse as well. Theoretically, it has been shown in [35] that for denoising using a 
TV prior, the CM estimate is, in fact, never sparse. One can see this, e.g., in Figure 
[8] The estimates are neither sparse in the normal basis, nor in the increment basis. 
In Figures [12} [l4| one can clearly see that this is similar for the normal LI prior. 



4-. 4- Outlook and Extensions 

In this first study, we only compared very basic variants of MH and Gibbs sampling. 
In the future, a comparison to more sophisticated variants of MH schemes such as 
delayed rejection [40 23 , adaptive Metropolis schemes [24, 23J or the t-walk [9J has to 
be undertaken. The most promising technique for our scenario might be to combine a 
tailored variant of delayed rejection with SCAM [53] . Many of the sophisticated MH 
variants have been developed for the study of non-linear, computationally extensive 
and high dimensional inverse problems (see |13| for a recent overview), i.e., situations 
where MH sampling is the only MCMC technique that can be applied. Improving 
the basic Gibbs sampling schemes used here by adding adaptive elements or optimal 
directions is far less developed until now [TD] and is an interesting future topic of 
research. A comparison between sophisticated (possibly adaptive, i.e., non-markovian) 
variants of MH and Gibbs sampling will also need a concrete application scenario since 
a general comparison by the measures used in this article is less meaningful. 
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We assumed that the variance a 1 of the noise term (cf. Section 2.1 1 is known exactly 
(or a good estimate is available), which is not always the case in practical applications. 
The Bayesian framework can account for the uncertainty of this model parameter as 
well: a 1 is treated like the other unknowns u (but assumed to be independent from 
them) and the available information about its typical values are expressed by a prior 
Ppr{o~ 2 )- An advantageous choice for p pr {o~ 2 ) is given by the conjugate prior w.r.t. 
pu(m\u, a 2 ), which is the inverse gamma distribution |19| : 

p(x\a ) 0) = -^x- a - 1 exp (-?-), (33) 



T(a) 

with shape and scale parameters a and f3 (T denotes the gamma function). Now the 
joint posterior for u and a 2 given the data m reads: 

p P ost(u, a 2 \m) cx 
k 

( AS] 2 cxp (_ _L| )m _ Au g _ X \ Du \) exp (_( a + 1) log^) _ A 



\2a*J l \ 2a 2 " " 2 1 "7 ^ V a 2 

exp ( W\ m - A u\\l + P _ X \Du\-(a + l + k/2 ) \og(a 2 )\ i ) 



A comparison with ( 33 1 shows that the conjugacy property of the prior has the effect 
that the posterior of a 2 conditioned on both m and u is, again, an inverse gamma 
distribution with shape and scale parameters a and j3 given by : 

& = a + k/2; p = |||m - Au\\l + /3 (35) 

This allows us to perform Bayesian inference for the joint posterior using Gibbs 
sampling: For sampling along a component of u we can use the fast samplers presented 
her^f] and for sampling over a 2 conditioned on all other parameters we can use 
standard implementations of gamma samplers. Such a Gibbs sampler can, e.g., be 
used to infer a joint CM estimate (ujCMi a jCM)- ^ ne information given by o~ 2 CM can 
be used to evaluate or improve the measurement setup or to inform other Bayesian 
reconstructions. The estimate UjCM compared to an estimate assuming a single, 
constant a 2 contains the marginalized uncertainty about a 2 and may yield a more 
robust estimate of u in practical applications. In principle, it is possible to marginalize 
over a 2 explicitly (the computation is similar to |19|): 

P P ost(u\m) = J p post (u, a 2 \m)da 2 cx (|||m- Au\\\ + 0) {a+k/2) Ppr ( u ) (36) 
i(u) a ^ (tf+1 ' 



oc IH —J Ppr{u), (37) 

where „ = 2a + 1 + k; t(u) = ( J^^ ) • (38) 

This is a (one sided) Student's t-distribution for t(u) with v degrees of freedom. 
However, working with this distribution directly is more difficult since it is not log- 
concave (this problem gets worse if an individual variance af for each noise channel is 
assumed). Working with the full joint posterior instead can circumvent some of these 
problems. 

§ The practical implementation has to be slightly adopted in the sense that the varying a 2 has to 
be removed from all expressions that are precomputed and added back to them at runtime. 
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The Gibbs samplers presented here have to be generalized to work with arbitrary D, 
e.g., to deal with anisotropic total variation priors in arbitrary dimensions and with 
arbitrary boundary conditions. In addition, an extension to block sparse priors, which 
rely on mixed L2-L1 norms, would be advantageous to, e.g., address isotropic total 
variation priors. 

Parallclization of MCMC sampling is easily implemented. In the most basic form, 
./V independent chains are generated, each on one CPU. However, the efficiency of 
this approach is strongly limited by the burn-in and mixing time [37 . If all chains 
are initialized at the same state, parallelization is only efficient, if the chains become 
independent very fast. Our results suggest that parallelization of Gibbs samplers will 
be way more efficient than of MH samplers. 

The Gibbs sampling algorithms developed by us are fast enough to tackle sampling for 
Bayesian inversion techniques in real applications, which will be an important topic 
of future work. In many applications like, e.g., limited angle CT, exploring the full 
range of Bayesian inversion by also incorporating sample based analysis was, up to 
now, rather regarded as a theoretical option, see, e.g., [491 134] . 

In addition, theoretical questions concerning sparse Bayesian inversion, like, e.g., the 
ones addressed in [3T>] f3"5] can be also be addressed numerically (cf. Section 3.1.4|. 
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Appendix A. Code 

On the authors homepage[}[] Matlab code supporting this publication is provided. It 
contains scripts to create the scenarios examined in the numerical studies as well as 
implementations of all the Gibbs sampling algorithms presented here. One should, 
however, mention that Matlab is not very well suited for these implementation as 
the sampling algorithms consist of very sequential but rather basic procedures. We 
therefore also provide alternative implementations of the samplers using Fortran within 
.mex-files, which can be compared to the corresponding .m-files. The speed-up factor 
for the RnGibbs Sampler ranges from 43 to 18 using n — 63 or n = 1023 for the ID 
scenario and is 1.35 for the 2D scenario with n = 261 121 (the computation of b is 
by far the most expensive computational task, and little gain can be expected from a 
direct implementation compared to Matlab). 



Appendix B. Implementation 



In this section, we give details on how to implement formulas (21 1, (22) and (23) 



The complementary error function and its inverse are difficult to handle numerically 
because there are no identities that allow to rescale or shift their evaluation to other 
intervals. For the applications we address, problems due to limited precision occur 
when formulas (21 1, (22 1 and ( [23] ) are implemented directly (formula (21 1 is only 
required for applying ordered overrelaxation) . Dependent on the signs of a+ and 
a_. we use different alternative formulas that allow for a stable numerical evaluation. 



Currently: http://wwwmath.uni-muenster.de/num/burger/organization/lucka 
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Additionally, we express erfc(x) in terms of the scaled complementary error function 
erfcx(a;) = exp(x 2 )erfc(x). which decays less fast for x — > +00. We only list the results 
here (the corresponding transformations are elementary but lengthy to write down). 
Because c 0, not both a + and a_ can be negative, which leaves three different cases 
to examine: 

a + > 0, a_ > 0: Let 7 ++ := [erfcx + erfcx («_)]. Then, the parts of (21 1 are 
given by: 



y < : exp (—ay 2 + 2y/aya+) erkx(—^/ay + a + )/"f ++ 
y > : 1 — cxp (—ay 2 — 2\fayaJ) ericx(Vay + a-)/j ++ 



(B.l) 
(B.2) 



The arguments of erfcinv in (22 1 and (23) are given by: 
In (22): rexp(-a^) 7++ 



In (231 



(1 - r)cxp(-a 2 _)j ++ 



a + < 0, a_ > 0: Since erfcx increases very fast for x — > 

erfcx(— x) — 2 exp(cc 2 ) — erfcx (a;). Let 7 |_ := [erfcx (- 

formulas to implement the parts of (21 1 are given by: 



(B.3) 
(B.4) 

00 one has to use the identity 
aq_) — erfcx (a_)]. Then, the 





y<0 


\fay 


- a + > 




y < 


Vay H 


- a + < 




y>0 


Vay- 


> 




y>0 


y/ay- 


\- a- < 



exp 



-(y/ay-a+) erfcx(- v / ay + a+) 



2 - cxp 



2 - cxp (-a\) 7_ + 

- (Vay - a + ) 2 erfcx( % /ay - 



exp 
1 

2 — exp 



2 - cxp (-ct\) 7_+ 
-{V^y -a-) 2 ericx(Vay + a_) 
2 exp (^) - exp (-a£) 7-+ 

(v^y^a-) 2 erfcx^v^y- a_) 
2 exp (^) - cxp (-a 2 ) 7-+ 



The arguments of erfcinv in ( 22 1 and (|23j) are given by: 
[2 - exp(-a 2 + ) 7 _ + ] 

- exp(— a 2 _)7 



In (22| 



In (231: (1 



2 exp 



(B.5) 
(B.6) 
(B.7) 
(B.8) 

(B.9) 
(B.10) 



a + > 0, a_ < 0: Let 7_| := [erfcx (a + ) — erfcx (— a_)]. Then, the parts of (21) are 

given by: 



y<0 



y>0 
V~ay + ex.- > 

y>0 
Vay + a- < 



exp 



(Vay - a + ) erfcx(- Vay + a + ) 



2exp(-^) + exp (-c^) 7+ _ 



1 - 



exp 



+ erfcx(- v /ay + a_) 



exp 



2 + cxp (-a 2 -) 7+- 
— (y/ay + a_) 2 erfcx(— V a V ~ a -) 



exp(-aij7_ f 



(B.ll) 

(B.12) 

(B.13) 
(B.14) 
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The arguments of erfcinv in (22 1 and (23) are given by: 
-be 



In (22| 



2 exp 



+ exp(-a + ) 7+ _ 



In p3l> : (1-r) [2 + cxp(-a^) 7+ _] 



(B.15) 
(B.16) 



Using the above expressions directly can still lead to stability issues, because very large 
numbers are often multiplied with very small numbers. It is preferable to compute 
the logarithms of the expressions first. For this, let x > 0, (x + y) > then: 

log(x + y) = log(» + log(l + sign(y) exp(log(|y|) - log(x)) (B.17) 



Using this identity we can compute the logarithms of expressions (B.l I - (B.16). We 
note that sign [±erfcx(-)] = ±1. 

(B.18) 
(B.19) 
(B.20) 
(B.21) 



log [ pH] )] = (-ay 2 + 2^Jaya + ) + log [erfcx(-^y + a+)] - log ( 7++ ) 

log [i - mm] - ' — 



log [erfcx( v / ay + a_)] - log ( 7++ ) 



-ay" — 2y/aya. 
log [Q] = log(r) - a* + log( 7++ ) 
log [ flR4| ] = log(l - r) - a 2 , + log( 7++ ) 
log [( [R5| ] = - (-\/ay + a+) 2 + log [erfcx(-^ + a+)] - log(2) 
- log {1 - sign( 7 _+) exp [-a\ + log(| 7 _+|) - log(2)] } 



(B.22) 



log [( |g.5[ )] = log (l - exp jlog [erfcx(y/ay - a+)] - log (2) - (^/ay - a+) 2 }) 

- log {1 - sign( 7 _+) exp [-a\ + log(| 7 _+|) - log(2)] } (B.23) 



log [1 - flgTffr ] = - (Vay + 2a_) + log [erfcx(v^2/ + a-)] - log(2) 



be 



log < 1 - sign( 7 „ + ) exp 



log(| 7 - + |)-log(2)-- 



(B.24) 



lo 



g [1 - (Q] - log (l - exp {log [erfcx(-V^ - a_)] - log (2) - (^y + a_) 2 }) 



6c 



log <M - sign ( 7 _ + ) exp 



-c£+log(| 7 _ + |)-log(2) 



be 



log = log(r) + log(2) 

+ log {1 - sign( 7 _ + ) exp [-a 2 + + log(| 7 _ + |) - log(2)] } 

log [mm] = log(l - r) + log(2) + - 



-a?.+log(|7_ + |)-log(2)- 



log < 1 - sign( 7 _+) exp 



6c 



log [(£.11)] 



/ay + a + ) 2 + log [erfcx(-\/ay + «+)] - log(2) + 



6c 



log <J 1 +sign( 7+ _)exp 

, 2 



-4+log(| 7+ _|)-log(2) + 



6c 



(B.25) 
(B.26) 

(B.27) 

(B.28) 
(B.29) 



log [ flgJ2| )] = - (y/ay + a-Y + log [erfcx( % /ay + a_)] - log(2) 

- log {1 + sign( 7+ _) exp [-a 2 _ + log(| 7+ _|) - log(2)] } 

log [ jBlfy ] = log (l - exp {log [erfcx(-^y - a_)] - log (2) - (^y + a_) 2 }) 

- log {1 + sign( 7+ _) exp [-a£ + log(| 7+ _|) - log(2)] } (B.30) 
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log [firm 



-- log(r 
■log 



log [(£.16 1] 



= log(l ■ 
log{l 



f-log(2)-- 
a 

-sign(7 + „)exp 
r) + log(2) 

-sign( 7+ _)exp [-a 2 _ + log(| 7+ _ |) - log(2)] } 



•log(|7+-|)-log(2) 



(B.31) 



(B.32) 



Now, for (22 1 and ( p3| , if w denotes the logarithm of the argument of erfcinv, one 
can compute erfcinvlog(w) := erfcinv [exp(u>)] using a standard implementation of 
erfcinv if w is not too small (the loss of precision using exp(w) instead of computing 
the full argument of erfcinv is negligible since the variation of erfcinv is very small 
even on logarithmic scale). However, even using 64 bit precision is not sufficient 
for the applications we address. Therefore, we use an asymptotic approximation of 
erfcinvlog(w) for w < —680 from pQ. 

An approximation of z — erfcinv [exp(ui)] for w — > — oo is given by: 

- log(7r) - log(-io) 
(-6-2) 

2/(9 -2w) 
1 



0,2 



03 



04 



32 V 



+ 6v - 6) 



1 

384 



(4v 3 



27v 2 

3/2 



108v 

5/2 



300) 

7/2 
04S ' 



a2S 3/2 + a 3 s 5 ^ + a 4S 7/2 (B.33) 

The discrepancy of this approximation to the implementation of erfcinv in Matlab 
is 2.34 • 10~ 12 for w = —690 and as it is an asymptotic formula, the error further 
decreases for w — > — oo. 
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